Conversation
solsson
left a comment
There was a problem hiding this comment.
Requesting changes (I created the PR so I can't reject it formally).
Only core changes reviewed. I will look at the benchmark setup for long term storage later.
| - OpenMetricsText0.0.1 | ||
| - PrometheusProto | ||
| - PrometheusText1.0.0 | ||
| - PrometheusText0.0.4 |
There was a problem hiding this comment.
Do we need all of these? I'd prefer we avoid legacy versions.
| expr: >- | ||
| sum(instance_cpu:node_cpu_top:rate5m) without (mode, cpu) | ||
| / | ||
| sum(rate(node_cpu_seconds_total[5m])) without (mode, cpu) |
There was a problem hiding this comment.
What's the community source for these rules?
| metric_relabel_configs: | ||
| - source_labels: [__name__] | ||
| regex: kube_replicaset_status_observed_generation | ||
| action: drop |
There was a problem hiding this comment.
We must do service discovery using conventions + labels. Make sure that ystack uses port names along with modern community standards for prometheus discovery, then update SD config so that it has no specific targets. I'm fine with more than one SD config as long as it's clear how a pod in any namespace can match it. Also ServiceMonitor sometimes is a use case so we need an example of that in ystack.
Remove all monitoring.coreos.com CRDs (Prometheus, Alertmanager, ServiceMonitor, PodMonitor, PrometheusRule) and replace with plain Kubernetes Deployments, ConfigMaps, and scrape config. Prometheus now uses kubernetes_sd_configs for target discovery instead of operator-managed ServiceMonitor/PodMonitor CRDs. Recording rules moved from PrometheusRule CRD to a ConfigMap-mounted rules file. A configmap-reload sidecar triggers /-/reload on changes. Consolidates k3s/30-monitoring-operator + k3s/31-monitoring into a single k3s/30-monitoring base. Updates converge and validate scripts accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v0.31.0 was not available in the container registry at experiment time. Revert this commit to restore v0.31.0 once it is published. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploy Thanos Receive (StatefulSet) + Query (Deployment) and GreptimeDB standalone as competing remote_write backends for the metrics-v2 experiment. Prometheus sends scraped metrics to both via remote_write for side-by-side comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thanos wins 8.35 vs 8.00 over GreptimeDB on weighted criteria: query correctness, operational complexity, resource usage, maturity, and storage cost projection. All PromQL queries returned consistent results across all three backends. Documents deviations from the original experiment plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both backends now write to versitygw object storage for storage cost comparison. Adds bucket-create jobs and S3 configuration for each. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WARNING comment included: these overrides should not be used in production. Forces frequent block cuts so S3 uploads are visible quickly during the metrics-v2 experiment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both backends now write to versitygw. GreptimeDB's columnar format produces 5.6x less data (252 KB vs 1.4 MB) for the same metrics workload. This flips the storage cost score and brings the weighted totals to a near-tie (Thanos 8.05 vs GreptimeDB 8.30). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
27a6b73 to
7e3d067
Compare
for provisioners that use a fixed IP. Use y-k8s-ingress-hosts -check before attempting -write, so provision can complete without a TTY or sudo when entries already exist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7191acd to
77af594
Compare
We can most likely meet our metrics endpoints discovery need using conventions and kubernetes_sd_config.
While we're at it we should revisit long term storage and querying.